Assigment instructions:
Author's note: we will do the EDA before the modelling.
from utils import *
'''The above "utils.py" is a custom package I've created with a colleague during a ML course. It contains
functions to optimize data types, fill missing data according to the type of data, and perform
one-hot encoding on categorical variables below a threshold of unique values decided by the user.'''
import plotly.graph_objects as go
import dash
import plotly.express as px
from plotly.subplots import make_subplots
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import dash
from dash import dcc
from dash import html
from dash.dependencies import Input, Output
import scipy.stats as stats
from scipy.stats import kurtosis
from sklearn.model_selection import train_test_split,KFold,StratifiedKFold
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
Here, we import the dataset and examine the distributions more closely. We will not yet add the second variable onto the visualization.
df = pd.read_csv("beer_reviews.csv")
df.shape # over 1 million rows. This is pretty computationally heavy, so we may reduce to a smaller sample when necessary.
(1586614, 13)
df.head()
| brewery_id | brewery_name | review_time | review_overall | review_aroma | review_appearance | review_profilename | beer_style | review_palate | review_taste | beer_name | beer_abv | beer_beerid | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10325 | Vecchio Birraio | 1234817823 | 1.5 | 2.0 | 2.5 | stcules | Hefeweizen | 1.5 | 1.5 | Sausa Weizen | 5.0 | 47986 |
| 1 | 10325 | Vecchio Birraio | 1235915097 | 3.0 | 2.5 | 3.0 | stcules | English Strong Ale | 3.0 | 3.0 | Red Moon | 6.2 | 48213 |
| 2 | 10325 | Vecchio Birraio | 1235916604 | 3.0 | 2.5 | 3.0 | stcules | Foreign / Export Stout | 3.0 | 3.0 | Black Horse Black Beer | 6.5 | 48215 |
| 3 | 10325 | Vecchio Birraio | 1234725145 | 3.0 | 3.0 | 3.5 | stcules | German Pilsener | 2.5 | 3.0 | Sausa Pils | 5.0 | 47969 |
| 4 | 1075 | Caldera Brewing Company | 1293735206 | 4.0 | 4.5 | 4.0 | johnmichaelsen | American Double / Imperial IPA | 4.0 | 4.5 | Cauldron DIPA | 7.7 | 64883 |
Here, based on our intuition, we hypothesize that taste review scores and aroma review scores will be among the best predictors of overall review scores because the taste and smell of a beer are the two most immediately accessible and most apparent features of a beer.
taste_df = df.groupby("review_overall")["review_taste"].mean().to_frame().reset_index()
taste_df # there's no data for the 0.5, so we'll create a row for the visualization with missing data
| review_overall | review_taste | |
|---|---|---|
| 0 | 0.0 | 2.142857 |
| 1 | 1.0 | 1.347909 |
| 2 | 1.5 | 1.738189 |
| 3 | 2.0 | 2.214519 |
| 4 | 2.5 | 2.671565 |
| 5 | 3.0 | 3.135203 |
| 6 | 3.5 | 3.586405 |
| 7 | 4.0 | 3.986737 |
| 8 | 4.5 | 4.304028 |
| 9 | 5.0 | 4.579720 |
myseries = pd.Series([0.5,''], index=["review_overall", "review_taste"])
taste_df = taste_df.loc[:9,:]
taste_df = taste_df[["review_overall", "review_taste"]]
taste_df = taste_df.append(myseries, ignore_index=True)
taste_df = taste_df.sort_values("review_overall").reset_index(drop=True)
taste_df
| review_overall | review_taste | |
|---|---|---|
| 0 | 0.0 | 2.142857 |
| 1 | 0.5 | |
| 2 | 1.0 | 1.347909 |
| 3 | 1.5 | 1.738189 |
| 4 | 2.0 | 2.214519 |
| 5 | 2.5 | 2.671565 |
| 6 | 3.0 | 3.135203 |
| 7 | 3.5 | 3.586405 |
| 8 | 4.0 | 3.986737 |
| 9 | 4.5 | 4.304028 |
| 10 | 5.0 | 4.57972 |
# Let's look at the relationship between review_overall (the outcome we are trying to predict) and review_taste
fig = make_subplots(specs=[[{"secondary_y": True}]])
fig.add_trace(go.Histogram(x=df["review_overall"], showlegend=False, name="Count of Overall Scores"), secondary_y=False)
fig.add_trace(go.Scatter(x=np.arange(0,5.5,0.5), y=taste_df["review_taste"], name="Avg. Taste Scores"), secondary_y=True)
fig.add_vline(x=df["review_overall"].median(),
line_dash = 'dash',
line_color = 'firebrick',
annotation_text="Median",
annotation_position="top right")
fig.add_vline(x=df["review_overall"].mean(),
line_dash = 'longdash',
line_color = 'deepskyblue',
annotation_text="Mean",
annotation_position="top left")
fig.update_layout(
title={
'text': "Distribution of Overall Review Scores and Average Taste Scores",
'y':0.95,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'},
showlegend=True)
fig.update_layout(yaxis2 = dict(range=[0, 5]))
fig.update_yaxes(title_text="Bin Counts", secondary_y=False)
fig.update_yaxes(title_text="Avg Taste Scores", secondary_y=True)
fig.show()
The relationship between overall review score and taste review score is as expected and seems strong. The higher the overall review score, the higher the average taste score.
aroma_df = df.groupby("review_overall")["review_aroma"].mean().to_frame().reset_index()
aroma_df # there's no data for the 0.5, so we'll create a row for the visualization
| review_overall | review_aroma | |
|---|---|---|
| 0 | 0.0 | 2.571429 |
| 1 | 1.0 | 1.906153 |
| 2 | 1.5 | 2.192794 |
| 3 | 2.0 | 2.600576 |
| 4 | 2.5 | 2.918690 |
| 5 | 3.0 | 3.234584 |
| 6 | 3.5 | 3.583218 |
| 7 | 4.0 | 3.874856 |
| 8 | 4.5 | 4.120822 |
| 9 | 5.0 | 4.328959 |
myseries = pd.Series([0.5,''], index=["review_overall", "review_aroma"])
aroma_df = aroma_df.loc[:9,:]
aroma_df = aroma_df[["review_overall", "review_aroma"]]
aroma_df = aroma_df.append(myseries, ignore_index=True)
aroma_df = aroma_df.sort_values("review_overall").reset_index(drop=True)
aroma_df
| review_overall | review_aroma | |
|---|---|---|
| 0 | 0.0 | 2.571429 |
| 1 | 0.5 | |
| 2 | 1.0 | 1.906153 |
| 3 | 1.5 | 2.192794 |
| 4 | 2.0 | 2.600576 |
| 5 | 2.5 | 2.91869 |
| 6 | 3.0 | 3.234584 |
| 7 | 3.5 | 3.583218 |
| 8 | 4.0 | 3.874856 |
| 9 | 4.5 | 4.120822 |
| 10 | 5.0 | 4.328959 |
# Let's look at the relatinoship between review_overall and review_aroma
fig = make_subplots(specs=[[{"secondary_y": True}]])
fig.add_trace(go.Histogram(x=df["review_overall"], showlegend=False, name="Count of Overall Scores"), secondary_y=False)
fig.add_trace(go.Scatter(x=np.arange(0,5.5,0.5), y=aroma_df["review_aroma"], name="Avg. Aroma Scores"), secondary_y=True)
fig.add_vline(x=df["review_overall"].median(),
line_dash = 'dash',
line_color = 'firebrick',
annotation_text="Median",
annotation_position="top right")
fig.add_vline(x=df["review_overall"].mean(),
line_dash = 'longdash',
line_color = 'deepskyblue',
annotation_text="Mean",
annotation_position="top left")
fig.update_layout(
title={
'text': "Distribution of Overall Review Scores and Average Aroma Scores",
'y':0.95,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'},
showlegend=True)
fig.update_layout(yaxis2 = dict(range=[0, 5]))
fig.update_yaxes(title_text="Bin Counts", secondary_y=False)
fig.update_yaxes(title_text="Avg Aroma Scores (/5)", secondary_y=True)
fig.show()
The relationship between overall review score and aroma score is as expected and fairly strong. The higher the overall review score, the higher the average aroma score.
In comparison with the taste scores, however, the aroma score seems slightly less strongly correlated, spanning a range of (4.3 - 1.9 = 2.4) for overall review scores between 1 and 5, whereas taste scores span a range of (4.6 - 1.3 = 3.3) for overal review scores between 1 and 5.
# Train - Validation - Test Split with the 80% - 10% - 10% convention we have agreed upon.
# We try with review taste and review aroma as predictors, still based on the hypothesis that taste and aroma
# are the most importance factors in predicting overall appreciation of a beer.
X, y = df[["review_taste", "review_aroma"]], df["review_overall"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
# Verification: it worked as intended. We now have an 80% - 20% split, with 5 folds in each split.
X_train.shape[0]/X.shape[0], y_train.shape[0]/y.shape[0], X_test.shape[0]/X.shape[0], y_test.shape[0]/y.shape[0]
(0.7999998739453956, 0.7999998739453956, 0.20000012605460432, 0.20000012605460432)
# Now we split in half the resulting test set to have an 80% - 10% - 10% split. We retrieve the middle point of the index.
X_test.shape[0], y_test.shape[0]
round(X_test.shape[0]/2,0)
158662.0
# Here we manually split the test set. Since it's an odd number, we're not gonna get a perfectly even split.
X_val = X_test.iloc[:158663] # we use slicing to grab the first half
X_test = X_test.iloc[158663:] # we use slicing to grab the second half
y_val = y_test.iloc[:158663] # same here
y_test = y_test.iloc[158663:] # same here
# Final verification of the train-validation-test split. Worked as intended.
print(X_train.shape[0]/X.shape[0], y_train.shape[0]/y.shape[0])
print(X_val.shape[0]/X.shape[0], y_val.shape[0]/y.shape[0])
print(X_test.shape[0]/X.shape[0], y_test.shape[0]/y.shape[0])
0.7999998739453956 0.7999998739453956 0.10000100843683467 0.10000100843683467 0.09999911761776967 0.09999911761776967
Now let's use
review_tasteandreview_aromato predictreview_overall. We surmised that taste and aroma are the most importance factors in predicting overall appreciation of a beer.
X_train.shape[0], X_val.shape[0], X.shape[0], y_val.shape[0]
(1269291, 158663, 1586614, 158663)
X_val.shape[0], y_val.shape[0], y_pred.shape[0]
(158663, 158663, 158663)
# Simple univariate linear regression model using the Statsmodels library. We will do it on the validation set.
# Creating a linear regression object
regr = LinearRegression()
# Fitting ("training") the model
regr.fit(X_train, y_train)
# Making predictions on the validation set
y_pred = regr.predict(X_val)
# The coefficients
print("Coefficients: \n", regr.coef_)
# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(y_val, y_pred))
# The coefficient of determination: 1 is perfect prediction
print("Coefficient of determination (R-squared): %.2f" % r2_score(y_val, y_pred))
Coefficients: [0.70519803 0.10597099] Mean squared error: 0.19 Coefficient of determination (R-squared): 0.63
The above results are already very good, with an R2 of 0.63, meaning that the model is able to explain about 63% of the variance. This is already pretty good. But we will be able to further improve later.